OOCOIBIT IBSOIB 



ID 1«7 3«0 



TB 906 814 



Tint 



XBSXITOim 
»0B OIIE 
BOtB 



IfAILiBIE PBOB 

BOBS I^XCB 
SBSCBIFXCBS 



ZDBBTIFXBBS 



9kj ShoaXd Al3^tiio« Sttta « nts lak< All Tho se l ^ sts? 
(Bv«rj-Stad«nt Testing or SaipXing of SeXected 
Groups?) • 

Bational Bducation Association » Mashington, D.C. 
Bay 75 

lOp.; For reXated docaeents* see TB 006 6il6, BO 084 
641, 091 821 « «nd 092 571; aXso aiaiXabXe in BO 1il6 
233 

BationaX Bdacation Association, 1201 16th Street* 
B.8«« Bashington« D.C. 20036. <Free of Charge) 

BF-10.83 FXas Postage. BC Boi AiaiXabXe froe BOBS. 
•Cost Bffectiteness; BdocationaX Accoautahilitj; 
BXesentary Seconder j Bdecations BiaXaatipn Bet hods; 
Groap Testst Ites SaspXing; •SaepXiag; SchooX 
Districts; *Stsadardixed Teats; State Frograas; 
Stttdent Testing; *Testing Frcfclcss; •Testing 
Frograea 

AX terns tives to Standardised Testing; ^•BationaX 
Bdttcaticn Association 



ABSTBACX 'y ■ 

The Bationci Bducation Association's Task Force on 
testing has stated its opinion that stAsdardised tests are overascd. 
The task force suggests that the appXication of saspXing technigaes 
sad a varietr of alteraatites to carrent testing [ractices sonXd 
accoaf Xla^ the saae parpcses* Bepresentatives of the testing indnstry 
hava indicated that the saapXing of stadent popoXatioas conXd 
•gaaXXy effective as the bianket testing of everj stadent* SaapXing 
procedaras vonXd aXso atsnrs the righ ts to privacy, a nd conserve 

9 sffort, and cost. Methods for deteraiaing whether or not 
saapXing shoaXd be. used are presented, aXong vith a brief discassion 
of itea saaoXing. C*Bthcr/BV) . 



7^ 



• fieeaaenta acgairad by BBZC iacXada aaay iafcraaX anpnbXiahed • 

• aatariaXs net avalXabXa free other soorces* BBXC aakes every ef f ort. • 

• to 9ktsia the best copy avaiXabla. BavertheXesi, iteas of aargiaaX • 

• npirtfdttibiXltr *se dftan encoaateted and this affects the gaaXlt; • 
« of tha iicrof i^a and hacdeopy rapcodactioas BBXC aakes avalXabXe • 

• via ti* IBZC oocaaaat Baprodactios sacvica (BOBS). BOBS is aot • 
« imgHiailibla f^c tie ^naXity of the origiaaX docaaant. Beprodactions • 

• saMi^Xiad by IDBS ace the bast that can be ande f roa the originaX. • 
4(ee«gM#*aeeeeeeeeeeee**ee*«a*^^aae«>»ee«e*eeee**^*^^^e*e«ee«*eee«e*e«** 



o 

us OIPARTMyiMtOPNlALTN. 
■ OUCATiON « WILFARI 
MATIOMAt IMSTITUTt OP 
■OUCATION 

THIS OOCUMCNT HAS teCN RE^KO* 
OUCEO EXACTLY AS KECElVEO FROM 
THE PERSON OR QROANIZATiON ORtGIM* 
ATJNC IT «>INTS OF VIEW OR OPINIOMS 
STATEO 00 NOT NECESSARILY REf>RE- 
SENT OFFICIAL N/CTIONAl INSTITUTE OF 
EOUCATfON POSITION OR K)LICV 



WHY SHOULD ALL THOSE STUDENTS T Al^ ALL THOSE TESTS? 



(EVERY-STUDENT TESTING OR SAMPLING OF SELECTED GROUPS?) 



Published by the 

NATIQKAL EDUCATION ASSOCIATION 
1201 16th Street, N. W., Washington, D. C. 20036 



May 1975 



WHY SHOULD ALL THOSE STUDENTS TAKE ALL THOSE TESTS? 
(EVERY-STUDENT TESTING OR SAMPLING OF SELECTED GROUPS?) 



The NEA Task Force' on Testing, In Its first Interim report, states: 



The;^a8k Force believes there Is overkill In the use 
of standardized tests and that the Intended purposes 
of testing can be accomplished through less use of 
standardized tests, through sampling techniques where 
tests are used, and through a variety of alternatives 
to tests. • • • 

Representatives of the testing Industry and others told 
the Task Force that sampling of student populations 
could be as effective as the blanket application of tests 
that Is now so coainon. Some suggested that suc1i proce- 
dures. In addition to Increasing the assurance of privacy 
rights, would conserve time, effort, and financial expen- 
diture.^ 

The blanket use of tests (every*pupil testing) in some state assessment 
and local testing progr'ams^^ppears to require inordinate ^Eounts of time and 
resources on the part of teachers, other personnel Involved in test admin- 
istration and lni:erpretation, and the students themselves. 

Criticisms of the blanket use of tests have come from a variety of 
prominent researchers, evaluators, and other educators. 

House, Rivers, and Stuff lebeam, in their evaluation of che Michigan 
account ability 6/8tem, concurred that in that state: 



Statewide testing as presently executed also raises the 
question of the feasibility of every pupil testing. This 
practice appears to be of dubious value when the cose of 
such an undertaking Is compared with the resulting benefits. 



^In Taak Force and Other Reports presented to the Fifty-Second Representative 
k88end>ly of the Katlonal Education A8soclatlt:>n, July 3<*6, 1973, Portland, 
Oregcm« pp« 26^46. 



to local level personnel.... The local, and hence overall, 
costs could be reduced ]>y a matrix sampling plan which 
requires that each student tented take only a few Items. 
In the long run, a matrix sampling plan will be the only 
one feasible from a cost and time standpoint. The cost 
aud time required for every pupil testing for th^ whole 
state would be horrendous.... We feel that it /strict 
adherence to a statewide testing model.7 will result in 
useless expenditures of monies and manpower, in addition 
to producing unwarranted disruptions of the educational 
programs within a great number of schools,^ ^ 

a 

In a paper entitled "Criteria for Evaluating State Education Account- 
ability Systems," the National 'Education Association has laid down fifteen 
basic principles, one of which is as follows: 



If the state desires test data for Its own planning pur 
poses, it should use proven matrix sampling techniques 
which will not reveal schools and which will greatly re 
duce costs. 



Matrix sampling techniques can give an accurate picture 
of the state by various categories much more efficiently 
than testing each child with an entire instrument,^ 



It was with such admonitions as these In mind that this paper was 
developed. And while some procedures are appropriate for evaluating all 
students In one way or another for particular purposes. It would appear 
that there Is gross over-use of blanket testing procedures. 

To help teachers and other educators better understand some main con- 
siderations related to sampling, the NBA obtained permission from Dr. Frank 
WomSh Michigan School Testing Service^ University of Michigan, to reproduce 



%ouse, Ernest R.; Rivers, Wendell; and Stuff lebeam, Dan. An Assessment of 
the Michigan Accountability System . Michigan Education Association and 
liati<»uil Education Association, March 1974 pp. 14-16, 



National Educa iO|i Association. "Criteria for Evaluating State Educatio^i 
Acccnmtabillty Systems." Vltohiagton, p. C: the Association, n.d... 



material from a monograph of his on developing assessment programs.^ In 
addition, Dr. Werner prepared, especially for this paper, a section an Item 
^sampling. Dr. Vomer's recomnendatlons follow* 

**************************** 

Determining Whether Sampling Is To Be Used 

The decision whether to test an entire population or use a sample 
Involves a combination of concerns.' Clearly there are policy considers* 
tlons; clearly there are psychometric^ considerations; clearly there are 
data collection considerations; and clearly there are cost consideration^. 
The best possible staff and consultant thinking on this question should be 
brought to an advisory committee for them to consider very carefully. 

Probably the most crucial consideration Is a policy one, since psycho- 
metrics, data collectTonT^ and cost generalTy would argue on the side of 
sampling rather than using an entire population. If It Is deemed wise for 
policy reasons to test all students In a population^ that preference, typically, 
will have to be weighed against available resources and technology; so we will 
consider first the policy implications of the two choicest 

O&e needs to look carefully at the purposes and goals of a specific 
assessment program in determining whether Sffapllng is appropriate . If all 
of the 8pecTfic~pufpd8e8 and objectives af-an asselssment program can be met 
-by group results, then sampling must be considered. 

^ , ^ 

^Horner, Frank B. Developing a Large-gcale Assessment Program . Denver: 
Cooperative Accountability Project, 1973 « 

Editor's note: Psychometrics in the strictest sense of the definition f 
has to do with the measurement of mental abilities. It has come to be 
used Biuch more broadly to define a wide range of activities In assessment 
ana evaluation. 



' The only asMSsneat situation that clearly calls for comnion data 
collection on all members of t\ie population is when it is deemed, essential, 
for improv.'»d decision making, to have exactly the same test information for 
every pupil in a given grade in a state (or other assessment unit). It is 
exactly this situation that has prevailed for years in local school districts 
that have every-pupil achievement or ability testing at some grade level. 
Historically, the compulsory state testing programs were examples of this 
situation; the voluntary programs were not. If a state mandates common 
testing for all students it is taking over a role that local districts 
traditionally have held. This may be good or this may be bad depending 
on one's point of view of the role of a state department of education. It 
certainly haff' important policy Implications. 

There are many facets to this point, but it should be kept clearly 

in mind that it is not necessary to test every pupil at a given gr&de 
level on identical material in order to a good picture of education 
Outcomes of groups of students; it is necessary^^liS|^f one feels that 
each teacher in an entire state at a given grade level must have the same 
inforiDation for each pupil. 

Probably the greatest advantage of scmpling is that for a given 
amount of effort (and money') one can gather more usabla information than 
by using an entire population . If the goals of an assessment program are 
to gather statewide Information only, it is hard to conceive of any reason 
for testing all -students in a given grade. ^For example, if thero are 50,000 
third-graders in the state of Limbo, and one wants to gather state statistics 
oaly^ it is very possible that a saoiple 5^,000 students (or even 500) would 



be sufficient If they are selected by a proba^Uty sample.,, Or, If 
one can afford to test all 50,000 third -graders, and If It Is deetn^d wise 
to do so, one conld select ten 5,000-pupll samples and secure information on 
ten subject areas, or one could go into great depth of information gathering 
In two or* three subjedt areas. The combinations of possibilities of sampling 
pupils and content are almost endless. 

If one wants dlstrlct--l)evel Information, then sampling becomes a 
different situation. In a school district with one third grade, sampling 
of pupils is hardly possible for most assessment purposes. In school districts 
with many third -graders, sampling could provide a greater variety of Informa* 
tlon than common testing on every pupil. In the same fashion as at the state 
level. Specific decisions of how far to carry sampling should be made only 
after advice from a sampling statistician. Sampling is a highly developed 
technical field, and the implications of any decisions to sample or not to 
sample must be reviewed by competent samplers. 

Other **comp£'omise" pQssibilities exist. One could test all students in 
a population with one ^l6rt test, while using a sampling approach for other 
tests. This ap^0#eH would provide some common information on all students but 
would allow ^ greater depth of data collection over a subject area. 

¥ 

Principle: Sampling of pupils and/or content should be 
given very serious consideration for all large-scale 
assessment projects. The only situation Where it may not 
be useful is one where it is deemed essential to collect 
common information on all students In a statewide population 



Editor's note: For information on probability samples, see Womer, o£. clt. 



of students. Sampling should be used to maximize 
the collection of usable Information for stated 
assessment purposes at th^ lowest posstirle cost and 
effort. 



Sampling with total tests is less complicated 
to administer 9 but since It 1^ likely to be sub- 
ject to error In administration and consequently 
less reliable 9 In some cases Item sampling may be 
more useful. Therefore, Dr. Womer was asked to 
prepare an additional statement otl the purposes _ 
and potential of Item sampling. His statement 
follows. 



Item Sampling 

The process of Item sampling In testing Is more useful for one of 
two purposes: 

1« to Increase the amount of group test results that can be 

obtained from students In a given period of time; or 
2, to decrease the amount of testing time necessary to obtain 

large amounts of group test Inforinatlon fror students. 
For either purpose, it Is essential to keep in mind that item sampling 
Is useful for gathering informatioa about groups of students. Thus it Is 

'■'8 



a technique, for use with relatively large groups, not a classroom*slzed 
group or even three or four classes within a building. 

Example 1 ^ 

A school system has 500 students in the sixth grade, A standard- 
ized reading test is to be adminis ^ ed for a one-shot systerawide 
' survey • The test takes 45 minutes to admi.;ister, which is all 
the time that can be taken from a bus]^ schedule at the end of 
the year. f 

Staff are unhappy that only reading is to be surveyed. Some 
major changes were made in the mathematics curriculum three 
years before and they feel it would be valuable to survey 
this subject also. By randomly selecting only 2S0 af the 
\ students to take"the reading test, the other 250 could be . 
given a 45 -minute mathematics test at the same time. _ 

Example 2 

A school system has 1,000 fourth -graders. It is desired to do 
an in-depth study of student outcomes for 100 different behavioija 
^objectives in mathematics. Each objiectlve requires the use of 
eight quest ions • The total of 800 questions would require one 
student to spend perhaps 15 hours jf testing time to attempt 
all of them. 

By randomly dividing up t^e objectives and items into five 
different subtests (each with 20 objectives and 160 items), 
each subtest could be administered to 200 students (randomly 



9 



pelected). Thls.vould require oa?.y 3 ,h6urs of test'ing time 

per itudent (manageable) rather than 15 hours (anmanageable) , 

and group results would still be available f ^r all 100 ' 
* * 

objectives (800 items). V 

In either example the results will be usable for group analyses. Any 
slight reduction in accurapy di^^to sampling error is apt to be much le«s 
than errors due to increasing testing time of students beyond some reasonable 
amount. Systematic errors due to fatigue, disinterest, poor motivation, 
teacher concern, and other conditions of testing can easily outweigh a 
small sampling error. 



1^ 



10 



